1 Introduction

The purpose of this project is to create a machine learning model that can predict the price of a diamond based on its charateristics. This report will start with an initial exploratory data analysis (EDA). The EDA will be filled with data visualizations and inital Q&As to help understand a data set. The EDA is then followed by a model exploration. Different machine learning alogrithms will be used to create predictive models. Models will then be compared against each other and the most accurate predictive model will be determined. The conclusion section will provide an executive summary of data insights and the selected predictive model.

2 EDA

This inital exploratory analysis will attempt to answer the following questions:

  1. First Look: What does the data look like? What are some initial descriptive statistics and insights.

  2. Best of the Best: Imagine someone wants to buy the best possible diamond, and money is no object. They only want to consider diamonds in the top categories of cut (Ideal), color (D), and clarity (IF). They want the most ideal range for depth (59-63) and table (54-57). Within the dataset, if we plot carat versus price can we fit a clean trendline? Is it linear? Exponential? What’s the price of the largest carat, and is it the most expensive?

  3. Depth and Table Percentages: I found the ideal depth and table values mentioned above online, but let’s explore the dataset a little. if we fix the 4 C’s (carat, cut, color, and clarity), how much do depth and table impact price? If we widen the ranges slightly, can we save a substantial amount?

  4. Best Bang for the Buck: Imagine someone wants to find the diamond which maximizes cut, color, and clarity per dollar. Using the expanded depth and table values from question 2 above, when does price start to increase exponentially for cut? What about for color? And clarity?

  5. Bigger is Better: Imagine a guy named Bob who wants to buy a pair of diamonds for his wife, and have them made into earrings for her birthday. In Bob’s mind, size (carat) is all that matters. He has $3200. He needs Two diamonds with the exact same cut, color, and clarity (with very comparable depth and table values), and he wants them to be as big as possible. What size carat can he afford? If he adjusts his budget, how much does the “maximum carat” size shift? Can we plot that and fit a line to it to find the “knee in the curve”?

2.1 First Look

This data set is provided in the ggplot2 package from CRAN. Let’s load the tidyverse collection of packages, which includes ggplot2 and some other packages (like dplyr) which are useful for data manipulation. Let’s also load the plotly package, which allows us to create advanced visualizations.

# Load libraries needed for EDA
library(tidyverse)
library(plotly)

# save diamonds data into an object called data
data <- ggplot2::diamonds

A sample of the data is provided below:

head(data)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Looks like there is a mix of numeric and categorical data. The numeric data (carat, depth, table, x, y, & z) are doubles (dbl), which means they are numerics with decimal places. Price is a numeric integer, so it will not have any decimal places. The categorical data (cut, color, and clarity) are ordered (ord), which means it has a specific order, such as best to worst. Similar to the sample above, the glimpse function provides a sample of the data in a different format that is more preferable when a data set is very large.

glimpse(data)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

There are 53940 rows of data. Each row is one observation / one diamond. There are 10 columns of data. Each column is a variable / feature. Below is a quick statistical summary of each feature:

summary(data)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

A minimum, 1st quartile, median, mean, 3rd quartile, and maximum is provided for all the numeric data. For example, the median price of diamonds in this data set is 2401. A data count is provided for all categorical data. For example, there are 21551 diamonds that are Ideal cut in this data set. Since all the categorical data is ordered, the counts are in its pre-specified order. For the example, the order for color is: D, E, F, G, H, I, J. Below is a definition for each feature:

carat: weight of the diamond (0.2 to 5.01)
cut: quality of the cut (in order of worst to best; Fair, Good, Very Good, Premium, Ideal)
color: diamond color (in order of best to worst; D, E, F, G, H, I, J)
clarity: a measurement of how clear the diamond is (in order of worst to best; I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF)
depth: total depth percentage (total height divided by total width) = 2 * z / (x + y) (43 to 79)
table: width of the top of a diamond relative to its widest point (43 to 95)
price: price in US dollars (326 to 18823)
x: length in mm (0 to 10.74)
y: width in mm (0 to 58.9)
z: depth in mm (0 to 31.8)

My OCD is kicking in a little bit for color. It is in the order of best to worst, while cut and clarity are in order of worst to best. The script below reorders color from worst to best for consistency. This will have no effect what-so-ever on the analysis, it’s just my personal preference.

# Reordering color so it's from worst to best. Just like all the other ordered features.
# This is personal preference. It's not a necessary step.
data <- data %>%                  # Create a new object, data, from the old object, data
  mutate(                         # Change its columns
    color = ordered(color,        # Make the color column and ordered column
                    levels = c(   # The order/levels are as follows:
                      "J",
                      "I",
                      "H",
                      "G",
                      "F",
                      "E",
                      "D"
                    )))

Now we can get to the interesting stuff. Let’s look at how price is affected by each individual feature.

2.1.1 Price vs Carat

Below is a scatter plot of price vs carat:

ggplot(data, aes(x=carat, y=price)) + # Plot the object data, with carat mapped to the x axis and price mapped to the y axis
  geom_point(alpha=0.1)               # Make the plot a scatter plot. Make each point 10% transparent/alpha


Price increases at an increasing rate as carat increases.

2.1.2 Price vs Cut

Below is a box-plot of price by cut.

{ggplot(data, aes(x=cut, y=price)) + # Plot the object data, with cut mapped to the x axis and price mapped to the y axis
  geom_boxplot()} %>%                # Make the plot a box plot
  ggplotly()                         # Make the plot interactive with the ggplotly function


There is no immediately visible trend. I would expect price to increase as the cut becomes better. However, other features may be affecting this box-plot. For example, maybe the average carat size for ideal diamonds is lower than the average carat size for fair. Therefore, the average price for ideal cuts in this data set are lower than fair cuts. This will be explored in more depth later on in this analysis. For now, we are only going to look at price vs each individual feature.

2.1.3 Price vs Color

Below is a box-plot of price by color.

{ggplot(data, aes(x=color, y=price))+
  geom_boxplot()} %>% 
  ggplotly()


Very interesting! As color gets better, price decreases. This absolutely does not pass the common sense test. Similar to cut, other features (like carat) must be affecting these results.

2.1.4 Price vs Clarity

Below is a box-plot of price by clarity.

{ggplot(data, aes(x=clarity, y=price))+
  geom_boxplot()} %>% 
  ggplotly()


These are similar results to color. As clarity gets better, price decreases. Once again, this does not pass the common sense test.

2.1.5 Price vs Depth Percentage

Below is a scatter plot of price vs depth percentage.

ggplot(data, aes(x=depth, y=price))+
  geom_point(alpha=0.1)


It looks like as the range of depth narrows, the price increases. It appears more expensive diamonds are roughly around 58% to 63% depth percentage.

2.1.6 Price vs Table Percentage

Below is a scatter plot of price vs table percentage.

ggplot(data, aes(x=table, y=price))+
  geom_point(alpha=0.1)


This looks similar to depth percentage. As the range of table percentage narrows, the price increases. Although, it doesn’t to appear to be as drastic as depth percentage.

2.1.7 Price vs Length (x)

Below is a scatter plot of price vs length (x).

ggplot(data, aes(x=x, y=price))+
  geom_point(alpha=0.1)


This looks similar to carat. Price increases at an increasing rate as length increases. There’s also some data points that have a length of zero. These must be missing data points and should be converted to NAs. If they are not, this can mess up some predictive models down the line. First let’s find out how many zeros there are.

df <- data %>%     # Create a new object df from the object data
  filter(x==0)     # show only the data that has a zero value for x

print(df)          # Display the object df
## # A tibble: 8 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1.07 Ideal     F     SI2      61.6    56  4954     0  6.62     0
## 2  1    Very Good H     VS2      63.3    53  5139     0  0        0
## 3  1.14 Fair      G     VS1      57.5    67  6381     0  0        0
## 4  1.56 Ideal     G     VS2      62.2    54 12800     0  0        0
## 5  1.2  Premium   D     VVS1     62.1    59 15686     0  0        0
## 6  2.25 Premium   H     SI2      62.8    59 18034     0  0        0
## 7  0.71 Good      F     SI2      64.1    60  2130     0  0        0
## 8  0.71 Good      F     SI2      64.1    60  2130     0  0        0

There are 8 rows of data with x labeled as zero and there are 53940 rows of data total. This is a small percent (1.483129410^{-4}) of the total data, but this should still be addressed. These values will be converted into NAs and what to do with them will be determined during model exploration. These rows of data could be removed from the data set completely, or a median length can be assumed, or potentially x can be calculated from depth and table percentage. Is it even worth addressing it all? This will be determined later in this project.

I also see zeros for y and z. The script below determines how many zeros there are for width (y):

df <- data %>% 
  filter(y==0)

nrow(df)   # display how many rows are in object df
## [1] 7

And for depth (z):

df <- data %>% 
  filter(z==0)

nrow(df)
## [1] 20

There are 20 missing data points for z. That is 3.707823510^{-4} percent of the data. It’s slightly more that x and y, but still a small sample. The script below will convert all zeros for x, y, and z into NAs, specifically numeric NAs.

data <- data %>%          # create a new object called data, from the old object called data
  mutate(                 # change the columns in the old object
    x = case_when(        # Change the column x by the following conditions
      x==0 ~ NA_real_,    # When x is 0, make it a NA
      TRUE ~ x),          # For everything else, keep x as is
    y = case_when(        # same logic as x
      y==0 ~ NA_real_,   
      TRUE ~ y),   
    z = case_when(        # same logic as x
      z==0 ~ NA_real_,
      TRUE ~ z))

All zeros are now NAs.

2.1.8 Price vs Width (y)

Below is a scatter plot of price vs width (y).

ggplot(data, aes(x=y, y=price))+
  geom_point(alpha=0.1)


First off, there are some outliers past a y of 20. Let’s see what these points are.

df <- data %>% 
  filter(y>20)

print(df)
## # A tibble: 2 x 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  2    Premium H     SI2      58.9    57 12210  8.09  58.9  8.06
## 2  0.51 Ideal   E     VS1      61.8    55  2075  5.15  31.8  5.12

These values for y can’t be right. One way to double check is to use the equation for depth percentage: depth = 2 * z/(x + y). Using a little algebra we can solve for y: y = (2z/depth) - x. Before using these equations, we need a proof of concept. Depth and y will be calculated using these equations and we’ll check if they equal the depth and y provided in the data. If they do, the equations are true.

df <- data %>%
  select(depth, table, x, y, z) %>%   # Select only these columns from data
  mutate(
    depth_calc = round(((2*z)/(x+y))*100, # create a new column, depth_calc using the equation
                       digits = 1),       # round to the nearest one digit
    y_calc = round((2*z/(depth/100))-x,   # create a new column, y_calc using the equation
                   digits = 2),           # round to the nearest two digits
    depth_check = depth==depth_calc,      # create depth_check, which is true if depth equals depth_calc
    y_check = y==y_calc) %>%              # create y_check, which is true if y equals y_calc
    select(depth, depth_calc, depth_check, y, y_calc, y_check) # Select only these columns
  
head(df, n = 20) # show the first 20 rows
## # A tibble: 20 x 6
##    depth depth_calc depth_check     y y_calc y_check
##    <dbl>      <dbl> <lgl>       <dbl>  <dbl> <lgl>  
##  1  61.5       61.3 FALSE        3.98   3.95 FALSE  
##  2  59.8       59.8 TRUE         3.84   3.84 TRUE   
##  3  56.9       56.9 TRUE         4.07   4.07 TRUE   
##  4  62.4       62.4 TRUE         4.23   4.23 TRUE   
##  5  63.3       63.3 TRUE         4.35   4.35 TRUE   
##  6  62.8       62.8 TRUE         3.96   3.96 TRUE   
##  7  62.3       62.3 TRUE         3.98   3.98 TRUE   
##  8  61.9       61.9 TRUE         4.11   4.1  FALSE  
##  9  65.1       65.1 TRUE         3.78   3.78 TRUE   
## 10  59.4       59.4 TRUE         4.05   4.05 TRUE   
## 11  64         64   TRUE         4.28   4.28 TRUE   
## 12  62.8       62.8 TRUE         3.9    3.9  TRUE   
## 13  60.4       60.4 TRUE         3.84   3.84 TRUE   
## 14  62.2       62.2 TRUE         4.37   4.36 FALSE  
## 15  60.2       60.2 TRUE         3.75   3.75 TRUE   
## 16  60.9       60.9 TRUE         4.42   4.42 TRUE   
## 17  62         62   TRUE         4.34   4.34 TRUE   
## 18  63.4       63.4 TRUE         4.29   4.29 TRUE   
## 19  63.8       63.8 TRUE         4.26   4.27 FALSE  
## 20  62.7       62.7 TRUE         4.27   4.27 TRUE

It looks like the equations work for the most part. In some instances they don’t match 100%, but they are in the ballpark. This may be due to rounding errors. Now that these equations are confirmed. Let’s calculate y for the two outliers.

df <- data %>% 
  filter(y>20) %>% 
  mutate(y_calc = round((2*z/(depth/100))-x,
                   digits = 2))

print(df)
## # A tibble: 2 x 11
##   carat cut     color clarity depth table price     x     y     z y_calc
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>
## 1  2    Premium H     SI2      58.9    57 12210  8.09  58.9  8.06   19.3
## 2  0.51 Ideal   E     VS1      61.8    55  2075  5.15  31.8  5.12   11.4

These values seem a lot more reasonable. Now let’s commit this change to the data and re-plot.

data <- data %>% 
  mutate(
    y = case_when(
      y>20 ~ round((2*z/(depth/100))-x,digits = 2),
      TRUE ~ y))
ggplot(data, aes(x=y, y=price))+
  geom_point(alpha=0.1)


Those data points still look like they may be outliers but they are more reasonable than before, and trend is easier to see in this chart. Price increases at an increasing rate as y increases.

2.1.9 Price vs Depth (z)

Below is a scatter plot of price vs depth (z).

ggplot(data, aes(x=z, y=price))+
  geom_point(alpha=0.1)


Looks like there is an outlier past a depth (z) of 30. This may be a similar case as seen with width (y). Below shows this data point:

df <- data %>% 
  filter(z>30)

print(df)
## # A tibble: 1 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.51 Very Good E     VS1      61.8  54.7  1970  5.12  5.15  31.8

This doesn’t pass the common sense test. Using the depth percentage equation we can calcuate depth (z) by using: z = depth*(x+y)/2. The results are below:

df <- df %>% 
  mutate(
    z_calc = round(((depth/100)*(x+y))/2, digits = 2)
  )

print(df)
## # A tibble: 1 x 11
##   carat cut       color clarity depth table price     x     y     z z_calc
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>
## 1  0.51 Very Good E     VS1      61.8  54.7  1970  5.12  5.15  31.8   3.17

The calculated z looks more reasonable. The script below commits this change to the data and re-plots.

data <- data %>% 
  mutate(
    z = case_when(
      z>30 ~ round(((depth/100)*(x+y))/2, digits = 2),
      TRUE ~ z
    )
  )
ggplot(data, aes(x=z, y=price))+
  geom_point(alpha=0.1)


That’s better! This trend appears to be the same as carat, x, and y. Now, onto some more detailed data exploration.

2.2 Best of the Best

Best of the Best: Imagine someone wants to buy the best possible diamond, and money is no object. They only want to consider diamonds in the top categories of cut (Ideal), color (D), and clarity (IF). They want the most ideal range for depth (59-63) and table (54-57). Within the dataset, if we plot carat versus price can we fit a clean trendline? Is it linear? Exponential? What’s the price of the largest carat, and is it the most expensive?

First we need to filter the data set to only the best cut, color, clarity, depth, and table.

df <- data %>% 
  filter(
    cut=="Ideal",
    color=="D",
    clarity=="IF",
    between(depth, 59, 63),
    between(table, 54, 57))

print(df)
## # A tibble: 24 x 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.51 Ideal D     IF       62      56  3446  5.14  5.18  3.2 
##  2  0.51 Ideal D     IF       62.1    55  3446  5.12  5.13  3.19
##  3  0.53 Ideal D     IF       61.5    54  3517  5.27  5.21  3.22
##  4  0.53 Ideal D     IF       62.2    55  3812  5.17  5.19  3.22
##  5  0.59 Ideal D     IF       60.9    57  4208  5.4   5.43  3.3 
##  6  0.56 Ideal D     IF       62.4    56  4216  5.24  5.28  3.28
##  7  0.56 Ideal D     IF       61.9    57  4293  5.28  5.31  3.28
##  8  0.63 Ideal D     IF       62.5    55  6549  5.47  5.5   3.43
##  9  0.63 Ideal D     IF       62.5    55  6607  5.5   5.47  3.43
## 10  1.04 Ideal D     IF       61.8    57 14494  6.49  6.52  4.02
## # ... with 14 more rows


This filters the data to only 24 options. Let’s plot carat versus price.

ggplot(df, aes(x=carat, y=price)) +
  geom_point()


Looks like a linear model could be viable. Let’s create one.

model_df_lm <- lm(price ~ carat, df)

summary(model_df_lm)
## 
## Call:
## lm(formula = price ~ carat, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2123.5 -1264.7   245.6  1026.5  2344.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5621.9      623.1  -9.023 7.57e-09 ***
## carat        20259.9      911.8  22.221  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1288 on 22 degrees of freedom
## Multiple R-squared:  0.9573, Adjusted R-squared:  0.9554 
## F-statistic: 493.8 on 1 and 22 DF,  p-value: < 2.2e-16


Carat passes the T test (p-value) by a mile and the R squared isn’t too bad. Let’s plot this model now.

pred_df_lm <- predict(model_df_lm, df)

df <- df %>% 
  mutate(pred = pred_df_lm)

ggplot(df, aes(x=carat, y=price)) +
  geom_point(color="blue") +
  geom_line(color="red", aes(y=pred))


Looks like the most expensive diamond might not be the largest one. Let’s find these two data points.

df2 <- df %>% 
  filter(
    carat == max(carat) |
      price == max(price)
  )

print(df2)
## # A tibble: 2 x 11
##   carat cut   color clarity depth table price     x     y     z   pred
##   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>
## 1  1.07 Ideal D     IF       60.9    54 17042  6.66  6.73  4.08 16056.
## 2  1.03 Ideal D     IF       62      56 17590  6.55  6.44  4.03 15246.


Interesting! The most expensive diamond is the not the largest diamond, but it has a larger depth and table compared to the largest carat diamond. This needs a closer look.

2.3 Depth and Table

Depth and Table Percentages: I found the ideal depth and table values mentioned above online, but let’s explore the dataset a little. if we fix the 4 C’s (carat, cut, color, and clarity), how much do depth and table impact price? If we widen the ranges slightly, can we save a substantial amount?

First let’s create a scatter plot of price vs carat.

ggplot(data, aes(x=carat, y=price)) +
  geom_point()


Next, let’s create a scatter plot of price vs depth.

ggplot(data, aes(x=depth, y=price)) +
  geom_point()


Now let’s create a scatter plot of price vs table.

ggplot(data, aes(x=table, y=price)) +
  geom_point()

2.3.1 Depth Deep Dive

A 3D scatter plot of price vs carat vs depth.

plot_ly(data=data, x=~price, y=~carat, z=~depth, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))


A 3D scatter plot of price vs carat vs depth with cut mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~cut, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))


A 3D scatter plot of price vs carat vs depth with color mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~color, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))


A 3D scatter plot of price vs carat vs depth with clarity mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~clarity, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))

2.3.2 Table Deep Dive

A 3D scatter plot of price vs carat vs table.

plot_ly(data=data, x=~price, y=~carat, z=~table, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="Table")))


A 3D scatter plot of price vs carat vs table with cut mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~table, color=~cut, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))


A 3D scatter plot of price vs carat vs table with color mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~table, color=~color, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))


A 3D scatter plot of price vs carat vs table with clarity mapped to color.

plot_ly(data=data, x=~price, y=~carat, z=~table, color=~clarity, type = "scatter3d", marker = list(size=1)) %>% 
  layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))


2.4 Best Bang for the Buck

Best Bang for the Buck: Imagine someone wants to find the diamond which maximizes cut, color, and clarity per dollar. Using the expanded depth and table values from question 2 above, when does price start to increase exponentially for cut? What about for color? And clarity?

Under development.

2.5 Bigger is Better

  1. Bigger is Better: Imagine a guy named Bob who wants to buy a pair of diamonds for his wife, and have them made into earrings for her birthday. In Bob’s mind, size (carat) is all that matters. He has $3200. He needs Two diamonds with the exact same cut, color, and clarity (with very comparable depth and table values), and he wants them to be as big as possible. What size carat can he afford? If he adjusts his budget, how much does the “maximum carat” size shift? Can we plot that and fit a line to it to find the “knee in the curve”?

Under development.

3 Model Exploration

3.1 Linear Regression

Under development.

3.2 Random Forest

Under development.

3.3 Model Comparison

Under development.

4 Discussion

Throughout this analysis some assumptions and data cleaning steps have been made. These decisions will impact how each predictive model will perform. A sensitivity analysis is recommended for the following data cleaning issues:

1. Issue: x, y, and z values labeled as zeros.
Action took in this analysis: Make all zeros NAs and ignore NAs in model exploration.

2. Issue: Potential outliers for y and z.
Action took in this analysis: Estimate y and z using the depth equation

5 Conclusion